GENERATING STRUCTURED DRUM PATTERN
USING VARIATIONAL AUTOENCODER AND SELF-SIMILARITY MATRIX

Supplementary Audio Files and Code

I-CHIEH Wei1, Chih-Wei Wu2,3, Li Su1,2

1Institute of Information Science, Academia Sinica, Taiwan
2Netflix, Inc., USA
3Netflix, Inc., USA
yinjyun_luo@mymail.sutd.edu.sg,kat_agres@ihpc.astar.edu.sg,dorien_herremans@sutd.edu.sg

GENERATING STRUCTURED DRUM PATTERN USING VARIATIONAL AUTOENCODER AND SELF-SIMILARITY MATRIX


Drum pattern generation is a task that focuses on the rhyth mic aspect of music and aims at generating percussive sequences.

audio synthesis.

Firstly, we present the audio in this work.

French horn
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Piano
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Cello
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim
Basson
The original audio waveform
The resynthesized audio waveform from the Mel-spectrogram using Grin-Lim

Now we demonstrate the controllable sound synthesis. As described in the paper Section 4.3, we specify the target pitch ym and instrument yk, and sample the pitch code zp and timbre code zt from the conditional distribution p(zp|ym) and p(zt|yk), respectively, where p(zp|yp)=N(μyp,diag(σyp)) and p(zt|yt)=N(μyt,diag(σyt)). In the following demonstration, we specify the same pitches for all instruments, play the audio and display the corresponding Mel-spectrograms.

English horn
French horn
Tenor Trombone
Trompet
Piano
Violin
Cello
Saxophone
Bassoon
Clarinet
Flute
Oboe

Many-to-Many timbre transfer


In this section, we demonstrate the model's applicability in timbre transfer.

As described in Section 4.4 in the paper, we first infer zp and zt of the source input, and modify zt (denoted as zsource) by:

ztransfer=zsource+αμsourcetarget,

where μsourcetarget=μtargetμsource, and α[0,1]. We then synthesize the spectrogram by passing [zp,ztransfer] to the decoder. See Fig. 4 for an illsutration of transferring French horn to piano.

Note that, in practice, we do not need labels of source instrument and pitch for timbre transfer, as the two variables are automatically inferred by q(zp|X) and q(zt|X), respectively. q(yt|X) infers the mixture component (source instrument identity) to which X belongs, and μsourcetarget is then obtained by subtracting mean of the mixture component of the target to the that of the source.

Following Fig. 5 in the paper, we demonstrate FhnPno, PnoVc, VcBn, and BnFhn.
The source instrument is gradually changed to the target instrument, by α={0,0.25,0.5,0.75,1.0}.

French horn to piano

C2 mf Fhn
F#2 pp Fhn

Piano to cello


Notice that the model is able to generalize to the pitch G6 which is not within the range of the cello.

G6 pp Pno
D3 pp Pno

Cello to Bassoon


F3 pp Vc
D#4 pp Vc

Bassoon to French horn


D#4 pp Bn
C5 pp Bn

Disentangling the spectral centroid


In this section, we present the effect of latent traverse along the 13th dimension of zt, which is discussed in Section 4.5 of the paper.

Ehn-B3-mf
Trop-D5-mf
Pno-C#6-mf
Vn-A4-mf
Bn-A3-mf
Ob-F6-mf